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Comment: How Should Indirect Evidence 
Be Used? 

Robert E. Kass 
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Abstract. Indirect evidence is crucial for successful statistical practice. 
Sometimes, however, it is better used informally. Future efforts should 
be directed toward understanding better the connection between sta- 
tistical methods and scientific problems. 

Key words and phrases: Bayesian, decision theory, prior information, 
statistical pragmatism, statistical science. 



When Brad Efron speaks about statistical the- 
ory and methods we should pay attention. In his 
talk, as he prefers to call it, he returns to a theme 
that has surfaced in previous ruminations: his un- 
ease with the foundations of statistics and his feel- 
ing that there is something missing. In this version 
he highlights indirect evidence as the aspect of sta- 
tistical reasoning in need of the theory he yearns 
for. 

The framework of statistical decision theory was 
created over 50 years ago for small, well-defined prob- 
lems. Efron seeks an extension to accommodate large 
datasets where individual observations bear an un- 
certain relationship to one another. He seems to 
think such an extension is possible and important 
for the future of the discipline. Perhaps he is right 
but, I'm sorry to say, I don't get it. In trying to 
understand the role of indirect evidence I would ex- 
amine not theoretical foundations but, instead, the 
relationship of statistical methodology to scientific 
inference in the context of specific applications. 

Efron begins by citing clinical trials as furnish- 
ing "direct evidence" about a question of interest. 
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It is easy to see what he means, but the stereotyp- 
ical problem in a clinical trial is somewhat special 
because all the relevant background knowledge has 
been focused on producing a simple treatment com- 
parison, a comparison that statistical inference will 
evaluate in a final declarative step. Clinical trials 
are aimed at treatment policy, so decision theory is 
highly relevant. In particular, the concepts of type 
I and type II error have an unusual immediacy be- 
cause decisions about patients must be made across 
a large population. 

In the scientific applications I am familiar with, 
statistical inferences are important, even crucial, but 
they constitute intermediate steps in a chain of in- 
ferences, and they are relatively crude. As Jeffreys 
pointed out long ago, inferences may be based on 
estimates and standard errors, and they typically 
need to be accurate only to first order. Similarly, 
in using the bootstrap we can get by with a fairly 
small number of observations from the bootstrap 
distribution because simulation uncertainty quickly 
becomes smaller than statistical uncertainty. Fur- 
thermore, statistical uncertainty is typically smaller 
than the unquantified aggregate of the many other 
uncertainties in a scientific investigation. I tell my 
students in neurobiology that in claiming statisti- 
cal significance I get nervous unless the p-value is 
much smaller than 0.01, and if some refinement of 
an estimate or p- value changes a conclusion, that in- 
determinacy itself becomes the story. To be convinc- 
ing, the science needs solid statistical results, but in 
the end only a qualitative summary is likely to sur- 
vive. For instance, in Olson et al. (2000), my first 
publication involving analysis of neural data, more 
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than a dozen different statistical analyses — some of 
them pretty meticulous, involving both bootstrap 
and MCMC — were reduced to the main message 
that among 84 neurons recorded from the supple- 
mentary eye field, "Activity reflecting the direction 
of the [eye movement] developed more rapidly fol- 
lowing spatial than following pattern cues." The sta- 
tistical details reported in the paper were important 
to the process, but not for the formulation of the ba- 
sic finding. Such settings seem to me vastly different 
than that conceptualized by decision theory. In judg- 
ing the role of statistical analysis within the general 
scientific enterprise, I prefer Fisher and Jeffreys to 
Neyman and Savage. 

If science is such a loose and messy process, and 
inferences so rough and approximate, where does 
all the statistical effort go? In my view, Jeffreys got 
it right. State-of-the-art analyses may take months, 
but they usually come down to estimates and stan- 
dard errors. The biggest news in the early 1990s was 
the development, understanding, and propagation 
of MCMC, which has had an enormous influence on 
statistical practice. The "Bayesian revolution," how- 
ever, in my view, is a misnomer. The most important 
method in Bayesian inference is what Fisher called 
the method of maximum likelihood. Most of the time 
what those people running Markov chains are doing 
is, essentially, computing MLEs. The "revolution" 
is really a maximum likelihood/Bayesian synthesis 
based on EM and Gibbs sampling, and their gen- 
eralizations. It has shown the power of the insights 
articulated by Fisher and Jeffreys. (With only a bit 
of a stretch Dirichlet processes and their relatives 
may be included as extensions of the basic ideas.) 
What has advanced over the years is the complex- 
ity of the problems we are able to attack, not the 
fundamental framework. 

Data analytic methods comprise both data 
manipulation — including estimates and standard 
errors — and interpretation. Manipulation involves 
the mechanics of statistical inference, interpretation 
its logic. If I am reading him correctly, Efron seems 
to be concerned primarily with the latter. To ex- 
emplify the kind of "difficult new problems" he has 
in mind Efron uses a hypothetical issue in apply- 
ing FDR to neuroimaging, half-brain versus whole- 
brain analysis. When fMRI first hit the scene, al- 
most 20 years ago, a statistician told me of psy- 
chologists who were doing many thousands of voxel- 
wise t-tests simultaneously. The standard method 
was to line up the test statistics in ascending order 



of magnitude, or descending order of p- value, and to 
pick a threshold that gave them suitable results. In 
our statistician's naivety, we shook our heads with 

indignation. (I was so much older then ) Then 

FDR came along and provided precisely the same 
method of data manipulation, but furnished a new 
interpretation. And it is a wonderful interpretation, 
very helpful. I think we all appreciate it. However, 
as its chief accomplishment is to bless the proce- 
dure psychologists were already using (but feeling 
uncomfortable about, due to problems in controlling 
family- wise error rate), it is hardly surprising that 
they like it. I am not by any means an expert in neu- 
roimaging, let alone in diffusion tensor imaging, but 
I am dubious about the scientific importance of half- 
brain versus whole brain FDR. I would guess the 
bigger issues involve connectivity across voxels and 
the hazards of warping brains from different individ- 
uals algorithmically so that their voxels are aligned. 
I should think a more pressing problem would be 
to devise within-subject expressions of uncertainty 
about white matter fibers in regions of potential in- 
terest, and a method of combining such things across 
subjects, within groups. (Apparently initial steps in 
getting local DTI uncertainy have been taken by Zhu 
et al., 2007, and by Efron's former student Armin 
Schwartzman, 2007, whom he cites.) 

In picking on this example I should acknowledge 
that everyone who discusses statistical methods per 
se abstracts away from details of the scientific 
problem — Fisher and Jeffreys did so, too, and it is 
unavoidable. I just do not yet understand the logical 
difficulty Efron is concerned about. While I certainly 
agree that the use of indirect evidence is a major 
challenge, especially in dealing with large datasets, 
it seems to me that with the passage of time our ex- 
isting logical frameworks are treating us remarkably 
well. Nor do I see any problem with being Bayesian 
in one analysis and frequentist in another, or even 
combining the two in a single swoop. The heyday of 
decision theory referenced by Efron occurred during 
a time that emphasized pure theory in many parts 
of academic life. Now we are in a much more util- 
itarian period and many of us are content to use 
whatever seems best suited for the task in front of 
us. As I have argued elsewhere (Kass, 2010), I be- 
lieve a straightforward philosophy I have called sta- 
tistical pragmatism can incorporate both Bayesian 
and frequentist inference. 

It is tempting to try to formalize the many as- 
pects of direct and indirect evidence that must get 
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weighed together, and it is possible to do so 
Bayesianly. Like Efron, however, I am wary. In Kass 
(1983) I commented on a very nice, but ambitious 
paper by DuMouchel and Harris in which they used 
a Bayesian hierarchical model to combine evidence 
about cancer across species: 

The Bayesian approach has its difficulties, for 
while it is surely desirable to express [knowl- 
edge] explicitly, in particular through models, 
it is often difficult to do so accurately. Lurking 
beside each analysis are the interrelated dan- 
gers of oversimplification, overstated precision, 
and neglect of beliefs other than the analyst's. 

Where I may disagree with Efron is that I do not 
think it is likely to be fruiful to try some other for- 
malization. The problem in such situations is not in- 
adequacy of logic, but rather the unclear relevance 
of the related evidence. As I said in Kass (1983), I 
would not want to apply formal methods in the ab- 
sence of pretty solid theoretical or empirical knowl- 
edge. 

In tackling the complexities of real-life science, 
real-life clinical trials, or real-life policy decisions, 
statisticians can bring unique insight based on sta- 
tistical expertise combined with nontrivial experi- 
ence in the substantive area. They then exercise 
good sense as they go along. My statistical bioin- 
formatics colleague Kathryn Roeder put this well 
recently when she told me, "I violate type I error all 
the time. And do you know why? I actually want to 
find those genes!" As Emery Brown and I empha- 
sized in a recent article (Brown and Kass, 2009), 
this requires new attitudes about training. It also 
requires an altered notion of our relationship to our 
collaborators: as Brown and I said, we should put to 
rest their characterization (used here by Efron) as 
"clients" and, instead, agree to share responsibility 
for all aspects of scientific inference — not just statis- 
tical ones. In attempting to understand the anatom- 
ical basis of dyslexia, of course it matters which 
part of the brain we focus on, but the choice can 
not be made in terms of abstract statistical argu- 
ments. It should result from closely-knit statistical, 
neuroimaging, neuroanatomical, and psychological 
judgment. 

Now, I am pretty confident that Efron will agree 
about this. I bring it up because we judge statisti- 
cal methods by the two rather different standards of 
theoretical performance (evaluated either by math- 
ematics or by simulation studies) and apparent ef- 
fectiveness in answering an applied question. I find 



it impossible to think about either one without con- 
sidering the other, and failure on either front serves 
to veto further contemplation. 

I understand Efron's "indirect evidence" to in- 
clude anything that could, in principle, be used to 
help formulate a prior for a Bayesian analysis. My 
impulse is to come at indirect evidence from an ap- 
plied perspective, and I think an uneasiness much 
like Efron's motivated me in 1990 to begin organiz- 
ing the workshop series Case Studies in Bayesian 
Statistics. I had the lofty goal of identifying and de- 
scribing key steps in using scientific and technolog- 
ical knowledge to build good Bayesian models and 
priors, so as to help turn the art of Bayesian statis- 
tical practice into a science. The idea was to gain 
understanding of statistical effectiveness by exam- 
ining methods carefully in an applied context, and 
I pointed to Mosteller and Wallace (1964) as the 
archetype. However, I must admit that while the 
workshops have been very successful as meetings, 
they never made much progress on the big agenda. 
The reason was simply that the audience was too 
diverse scientifically, so that speakers could not get 
very far into the details of connecting statistics to 
science that I originally had in mind. In 2002 Emery 
Brown and I began a series of meetings Statistical 
Analysis of Neural Data which are broader statisti- 
cally but, due to their narrower scientific focus, may 
actually be more successful in providing material for 
learning about statistical methods. 

I have been negative about comprehensive 
Bayesian analyses, yet I have spent much time and 
effort trying to understand and promote Bayesian 
methods. In many circumstances Bayesian methods 
are great, and very hard to beat. The nonparametric 
regression method BARS, for example (DiMatteo, 
Genovese and Kass, 2001), began with existing fre- 
quentist and Bayesian results on free-knot splines 
and used reversible-jump MCMC to great advan- 
tage; it was difficult to code properly and takes a 
long time to run on even modestly sized datasets, 
but I have not seen another general method pro- 
duce smaller mean-squared error and more accurate 
coverage probabilities, and I would be surprised to 
find an alternative that works much better for the 
problem we designed BARS to solve, namely Pois- 
son regression with smoothly varying means, which 
is suitable for fitting neural firing rate intensity func- 
tions. BARS illustrates a general truism: we may ex- 
pect Bayes to work well if there is solid knowledge 
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about the problem that can lead to useful formaliza- 
tion, if one is willing to spend the time it takes to be 
careful, and if one has the computing resources to 
get the job done. These are big "ifs." The challenge 
of indirect evidence is to figure out when they are 
satisfied. 
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